Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

Copy link
Contributor Author

MasterPtato commented Jan 13, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 13, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3862

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3862

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3862

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3862

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3862

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3862

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3862

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@3862

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3862

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3862

commit: 4a1bc29

@claude
Copy link

claude bot commented Jan 13, 2026

PR Review: fix(guard): clean up guard

This PR removes the request context tracking system and simplifies guard architecture by removing per-actor tracking. Overall, the changes improve code maintainability by removing unused functionality. Here are my findings:

Strengths

  1. Significant code reduction - Removes 184 lines from request_context.rs and 199 net lines from proxy_service.rs, improving maintainability
  2. Cleaner API surface - The CustomServeTrait now uses explicit ray_id/req_id parameters instead of a stateful RequestContext
  3. Simplified rate limiting - Changed from per-actor rate limiting to per-IP, which is more straightforward
  4. Removed unused ClickHouse dependency - Eliminates the analytics insertion code that was not being used

Code Quality Issues

  1. Missing error context in StandaloneCtx::with_ray (gasoline/src/ctx/standalone.rs:70)

    • The new with_ray method calls StandaloneCtx::new(...)? but does not add context to the error
    • Should use .context for better debugging
  2. Inconsistent metric labels (guard/src/routing/mod.rs)

    • Line 130: Uses uppercase GATEWAY
    • Line 147: Uses uppercase API
    • Line 75, 85, 114: Uses lowercase gateway, runner
    • Recommendation: Standardize all metric labels to lowercase for consistency
  3. Hardcoded default values in middleware (guard/src/middleware.rs:20-34)

    • The middleware function returns hardcoded values (10000 requests, 2000 concurrent, etc.)
    • The comment says "In a real implementation, you would look up actor-specific middleware settings"
    • Question: Is this intentional for this PR or should this be configurable?
  4. Rate limiting logic change may affect production (proxy_service.rs:510-548)

    • Changed from (actor_id, ip_addr) cache key to just ip_addr
    • This means rate limits are now only per-IP, not per-actor-per-IP
    • Security concern: This could allow clients to hit the same actor from different IPs to bypass limits, or block legitimate traffic to different actors from the same IP
    • The default values also changed significantly: Old actor-specific: 100 req/60s, 20 concurrent. New global: 10000 req/60s, 2000 concurrent
    • Recommendation: Verify this is the intended behavior change and document it in the PR description
  5. Removed actor_id from error messages (errors.rs:6-14)

    • The RateLimit error no longer includes actor_id in the message
    • This makes debugging rate limit issues harder
    • Recommendation: Consider keeping the actor info in logs even if not in the error struct

Potential Bugs

  1. Routing function signature mismatch (guard/src/routing/mod.rs:40)

    • The closure calls ctx.with_ray(ray_id, req_id).unwrap()
    • Using .unwrap() in production routing code is dangerous - it will panic on failure
    • Fix: Change to proper error handling
  2. No WebSocket close frame change verification (proxy_service.rs:2405)

    • Changed from Option to Id for ray_id parameter
    • The function err_to_close_frame now always includes ray_id in the close reason
    • Verify this does not break WebSocket clients that expect the old format

Performance Considerations

  1. Cache key simplification - Changing from tuple keys (Id, IpAddr) to just IpAddr should improve cache performance
  2. Removed analytics overhead - No more spawning tasks to insert ClickHouse events on every request is a good performance win

Security Concerns

  1. Rate limiting scope change - Per-IP instead of per-actor-per-IP could have security implications
  2. Error information leakage - Verify that removing actor_id from errors does not expose information about what actors exist

Test Coverage

  • Missing: No new tests added for the with_ray method
  • Missing: No tests verifying the new rate limiting behavior
  • Recommendation: Add tests to verify rate limits work correctly with just IP addresses, the with_ray method preserves from_workflow flag, and WebSocket close frames include ray_id correctly

Documentation

  • Missing: PR description does not explain why the request context tracking was removed or what the implications are
  • Missing: No migration notes for any breaking API changes
  • Recommendation: Add a brief explanation of the architectural change and its benefits

Style and Conventions

  1. Per CLAUDE.md, comments should be complete sentences - most comments follow the guidelines
  2. No glob imports detected - good adherence to CLAUDE.md guidelines

Summary

This is a solid cleanup PR that removes unused functionality. The main concerns are:

  1. Critical: Fix the .unwrap() in the routing function (routing/mod.rs:40)
  2. Important: Verify the rate limiting behavior change is intentional and document it
  3. Nice to have: Standardize metric labels, add error context, improve test coverage

The architectural simplification is good, but make sure the rate limiting changes align with product requirements before merging.

@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from c079c96 to 9e8cf3d Compare January 14, 2026 02:05
@MasterPtato MasterPtato force-pushed the 01-12-fix_add_ready_chunks_to_worker_bumps_bumps_per_tick_metric branch from 3df8f27 to a2b521a Compare January 14, 2026 02:05
@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from 9e8cf3d to 3008b28 Compare January 14, 2026 02:53
@MasterPtato MasterPtato force-pushed the 01-12-fix_add_ready_chunks_to_worker_bumps_bumps_per_tick_metric branch from a2b521a to 061a672 Compare January 14, 2026 02:53
Copy link
Contributor Author

@NathanFlurry make sure to double check that re-enabling rate limiters and route cache here (now that its not per actor id but just IP) is fine

@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from 3008b28 to 70900c0 Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-12-fix_add_ready_chunks_to_worker_bumps_bumps_per_tick_metric branch from 061a672 to 535af14 Compare January 14, 2026 03:06
Copy link
Member

@NathanFlurry NathanFlurry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crank limit to nutso levels

@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from 70900c0 to 5f50433 Compare January 14, 2026 19:47
@MasterPtato MasterPtato force-pushed the 01-12-fix_add_ready_chunks_to_worker_bumps_bumps_per_tick_metric branch from 535af14 to 24b4824 Compare January 14, 2026 19:47
@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from 5f50433 to d4e5e20 Compare January 14, 2026 20:01
@MasterPtato MasterPtato force-pushed the 01-12-fix_add_ready_chunks_to_worker_bumps_bumps_per_tick_metric branch from 24b4824 to 45cdaf0 Compare January 14, 2026 20:01
Copy link
Member

@NathanFlurry NathanFlurry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rate limit looks unchanged??

@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from d4e5e20 to f831199 Compare January 14, 2026 22:45
Copy link
Contributor Author

I'm not sure where the changes went, should be there now

@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from f831199 to c90f0b8 Compare January 14, 2026 22:47
@claude
Copy link

claude bot commented Jan 14, 2026

Pull Request Review: fix(guard): clean up guard

Summary

This PR significantly simplifies the guard proxy service by removing the request analytics/context tracking system and refactoring the middleware and routing layers to pass ray_id and req_id directly instead of maintaining per-actor state. The changes result in a net reduction of 532 lines of code.


✅ Strengths

  1. Excellent Code Reduction: Removing 532 lines while adding only 166 shows effective simplification
  2. Cleaner Interface: Passing ray_id/req_id explicitly to CustomServeTrait methods is clearer than implicit context
  3. Simplified Rate Limiting: Removing per-actor rate limiting in favor of IP-based limiting reduces complexity
  4. Good Use of New Context Method: The StandaloneCtx::with_ray() method (gasoline/src/ctx/standalone.rs:70-84) is a clean addition

🔴 Critical Issues

1. Removed Analytics Without Migration Path (proxy_service.rs)

The entire RequestContext and ClickHouse analytics system was removed:

  • Deleted request_context.rs (184 lines)
  • Removed clickhouse_inserter parameter from ProxyState
  • No migration notes or feature flag

Impact: Complete loss of request analytics data. If this data was used for monitoring, debugging, or billing, this is a breaking change.

Recommendation:

  • Document why analytics were removed in the PR description
  • If analytics are still needed, consider re-implementing them in a simpler form
  • Add a TODO or issue reference if this will be replaced

@claude
Copy link

claude bot commented Jan 14, 2026

🔴 Critical Issues (continued)

2. Middleware Function Signature Breaking Change (proxy_service.rs:210-217)

The middleware function signature changed, removing the actor_id parameter.

Impact: This breaks all existing middleware implementations. While guard/src/middleware.rs was updated, any external implementations will break.

Recommendation: Document this breaking change in commit message or PR description

3. Rate Limiting Behavior Change (proxy_service.rs:510-548)

Previously, rate limiting was per-actor + IP. Now it's only per-IP.

Impact:

  • Users hitting multiple actors from the same IP now share a rate limit pool
  • Could allow one user to DOS multiple actors
  • Removes ability to have per-actor rate limit policies

Recommendation: Reconsider if this is the desired behavior. If per-actor limits are needed, you may need to keep actor_id in the cache key.

4. Default Middleware Config Change (proxy_service.rs:455-488)

The fallback middleware config changed significantly:

  • Rate limit: 100 → 10000 requests per 60s (100x increase)
  • Max in-flight: 20 → 2000 (100x increase)
  • Max retry attempts: 3 → 7
  • Initial retry interval: 100ms → 150ms

Impact: This is a 100x increase in rate limits and concurrent requests. While more permissive, it may expose services to higher load than expected and could mask performance issues during development.

Recommendation: Document this change and ensure it aligns with intended behavior.

@claude
Copy link

claude bot commented Jan 14, 2026

⚠️ Important Issues

5. Error Message Breaking Change (errors.rs:6-15)

Removed actor_id from RateLimit error message.

Impact: Reduces debuggability. When investigating rate limit issues, knowing which actor was targeted is valuable.

Recommendation: Consider keeping actor_id in the error if it's available from context.

6. Inconsistent Metric Labels (routing/mod.rs:75, 130, 147)

Different cases used for the same metric:

  • Line 75: "gateway" (lowercase)
  • Line 130: "GATEWAY" (uppercase)
  • Line 147: "API" (uppercase)
  • Line 138: "api" (lowercase)

Impact: Metrics will be split across different labels, making monitoring harder.

Recommendation: Standardize to lowercase (e.g., "gateway", "api", "runner").

7. Missing Error Handling in Routing (routing/mod.rs:40)

Uses .unwrap() when creating context with ray IDs.

Impact: Using .unwrap() in production code can cause panics. If with_ray() can fail, this should be handled gracefully.

Recommendation: Use ? operator and return a proper error.

8. Potential Denial of Service (proxy_service.rs:345)

The in-flight request cache has a very high limit of 10,000,000 entries.

Impact: 10 million entries could consume significant memory. Each entry stores at minimum a request ID (likely 16-32 bytes) plus cache overhead.

Recommendation: Consider if this limit is realistic or if it should be lower with proper backpressure.

@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from c90f0b8 to 9d72300 Compare January 14, 2026 22:52
@claude
Copy link

claude bot commented Jan 14, 2026

💡 Suggestions & 🧪 Testing Concerns

9. Documentation Needed for with_ray (gasoline/src/ctx/standalone.rs:70)

The new with_ray() method lacks doc comments explaining:

  • When to use it vs new()
  • Why from_workflow is preserved
  • What scenarios might cause it to fail

10. Comment Style Improvements (custom_serve.rs:38-39)

Per CLAUDE.md, write comments as complete sentences. Avoid fragmented structures.

🧪 Testing Concerns

No test coverage found for guard-core changes. Given the significant behavioral changes to:

  • Rate limiting (per-actor → per-IP)
  • Middleware configuration (100x limit increase)
  • Retry behavior (3 → 7 attempts)

Recommendation: Add integration tests covering:

  • Rate limiting behavior with multiple IPs
  • Middleware config application
  • Retry logic with the new defaults
  • Error handling for with_ray() failures

📋 Style & Conventions

Generally follows CLAUDE.md guidelines:

  • ✅ Uses workspace dependencies correctly
  • ✅ Structured logging with tracing
  • ✅ Proper error handling with RivetError derive
  • ⚠️ Comment style could be improved in a few places
  • ⚠️ Some unwraps should use ? operator

@MasterPtato MasterPtato force-pushed the 01-12-fix_add_ready_chunks_to_worker_bumps_bumps_per_tick_metric branch from 45cdaf0 to 445bb05 Compare January 14, 2026 22:52
@claude
Copy link

claude bot commented Jan 14, 2026

PR Review: fix(guard): clean up guard

Summary

This PR removes the RequestContext system (184 lines deleted) and associated ClickHouse analytics tracking from guard-core, simplifying the proxy service significantly. The main changes involve removing actor_id from rate limiting/tracking and passing ray_id/req_id directly through the call chain instead of bundling them in a context object.


✅ Positive Changes

1. Significant Complexity Reduction

  • Removed 532 lines (vs 166 added) - net reduction of ~366 lines
  • Eliminated the RequestContext abstraction layer which was tracking 20+ fields
  • Simplified the CustomServeTrait interface

2. Cleaner API Design

  • Direct parameter passing (ray_id, req_id) is more explicit than bundled context
  • with_ray() method in gasoline/StandaloneCtx provides clean context cloning

3. Better Separation of Concerns

  • Analytics concerns removed from the proxy layer
  • Rate limiting simplified to use IP address only

🔍 Issues & Concerns

1. Loss of Analytics Data (CRITICAL)

Issue: All analytics tracking has been removed with no replacement mechanism visible in this PR.

Questions:

  • Is analytics tracking being moved elsewhere, or is this data no longer needed?
  • Were the following metrics valuable? If so, how will they be captured now?
    • Request/response body sizes, Service IP addresses, Response duration
    • User agent strings, Referrer information, Response content types

Recommendation: If analytics are still needed, clarify the migration path.


2. Removal of Actor-Specific Rate Limiting

Issue: Rate limiting is now only per-IP, not per-actor-per-IP.

Impact:

  • A single IP can now consume the rate limit across ALL actors (10k req/60s total)
  • Before: Each actor had independent rate limits (100 req/60s per actor)
  • This could enable cross-actor DoS from a single IP

Example: Client at IP 1.2.3.4 sends 10k requests to actor_a (hits limit) and can no longer send requests to actor_b, actor_c, etc.

Questions:

  • Was the per-actor rate limiting causing issues?
  • Is IP-only limiting sufficient for your threat model?
  • Are there upstream rate limiters handling per-actor limits?

3. Changed Default Rate Limits

The defaults when middleware returns NotFound changed significantly:

Metric Before After Change
requests 100 10000 100x increase
max_in_flight 20 2000 100x increase
max_attempts 3 7 2.3x increase

Questions:

  • Was the old limit (100 req/min) too restrictive in practice?
  • Are these new defaults aligned with your infrastructure capacity?
  • With the 100x increase + removal of per-actor limiting, what prevents resource exhaustion?

4. Inconsistent Metrics Labels

Issue: Inconsistent casing in metric labels at routing/mod.rs:75, 130, 138, 147.

Lines use both "gateway" (lowercase) and "GATEWAY" (uppercase), same for "api" vs "API".

Recommendation: Use consistent lowercase labels: gateway, api, runner, none.


5. Error Message Simplification Lost Context

Issue: Rate limit errors no longer include which actor was targeted.

Before: "Too many requests to path (actor_id: Some(id)) from IP"
After: "Too many requests to path from IP"

Impact: Debugging production rate limit issues becomes harder.


🎯 Code Quality Observations

Good Practices Followed

✅ Removed unused imports and the entire request_context.rs module
✅ Updated function signatures consistently across all callers
✅ Added with_ray() helper method to StandaloneCtx for clean context creation
✅ Maintained tracing instrumentation

Minor Issues

  • Multiple functions now have unused ray_id and req_id parameters (e.g., api_public.rs:23-24). This is likely fine if needed for the trait.

🔒 Security Analysis

Potential Security Concerns

  1. ⚠️ Rate Limit Bypass via Actor Enumeration - Single IP can now DoS all actors instead of being limited per-actor
  2. ✅ No Obvious Vulnerabilities Introduced - No SQL injection, XSS, or command injection vectors added

🧪 Testing Recommendations

This PR would benefit from tests covering:

  1. Rate limiting behavior - Verify IP-based rate limiting and that requests to different actors share the same bucket
  2. Context propagation - Verify ray_id and req_id flow correctly
  3. Error messages - Verify rate limit errors provide sufficient debugging info

📝 Questions for Author

  1. Analytics: Where is request analytics tracking moving? Is this data no longer needed?
  2. Per-Actor Rate Limiting: Was this causing issues? Is IP-only limiting sufficient?
  3. Default Limits: Are the 100x increased defaults (10k req/min, 2k concurrent) intentional and tested?
  4. Migration: Are there any production configurations that relied on the old rate limit values?

✅ Approval Recommendation

APPROVE WITH CONCERNS - The code quality is good and the simplification is valuable, but:

  1. ⚠️ Must address: Analytics data loss - clarify if intentional or document migration
  2. ⚠️ Must address: Rate limit change impact - verify this wont cause resource issues
  3. ⚠️ Should fix: Inconsistent metric labels
  4. 📋 Consider: Document the behavioral changes (per-actor → per-IP limiting)

Overall: This is a solid refactoring that removes significant complexity. The main concerns are around the behavioral changes (per-actor → per-IP rate limiting, 100x limit increases) and the loss of analytics tracking. These may all be intentional, but they represent significant changes that should be explicitly acknowledged and tested.

@MasterPtato MasterPtato force-pushed the 01-12-fix_add_ready_chunks_to_worker_bumps_bumps_per_tick_metric branch from 445bb05 to a084bd6 Compare January 14, 2026 23:02
@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from 9d72300 to 875a7a4 Compare January 14, 2026 23:02
@MasterPtato MasterPtato force-pushed the 01-12-fix_add_ready_chunks_to_worker_bumps_bumps_per_tick_metric branch from a084bd6 to fab1f2d Compare January 14, 2026 23:07
@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from 875a7a4 to 920b1d9 Compare January 14, 2026 23:07
@MasterPtato MasterPtato force-pushed the 01-12-fix_add_ready_chunks_to_worker_bumps_bumps_per_tick_metric branch from fab1f2d to c46083b Compare January 14, 2026 23:39
@MasterPtato MasterPtato force-pushed the 01-12-fix_guard_clean_up_guard branch from 920b1d9 to 4a1bc29 Compare January 14, 2026 23:39
@graphite-app
Copy link
Contributor

graphite-app bot commented Jan 14, 2026

Merge activity

  • Jan 14, 11:40 PM UTC: MasterPtato added this pull request to the Graphite merge queue.
  • Jan 14, 11:41 PM UTC: CI is running for this pull request on a draft pull request (#3908) due to your merge queue CI optimization settings.
  • Jan 14, 11:42 PM UTC: Merged by the Graphite merge queue via draft PR: #3908.

graphite-app bot pushed a commit that referenced this pull request Jan 14, 2026
@graphite-app graphite-app bot closed this Jan 14, 2026
@graphite-app graphite-app bot deleted the 01-12-fix_guard_clean_up_guard branch January 14, 2026 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants